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primitive  parallel  operation 


Evan  R.  Cohn*  Ramsey  W.  Haddad 

Stanford  University  Stanford  University 


Abstract 

We  will  consideriKthe  primitive  parallel  operation  of  the  Connection 
Machine,  the  Beta  Operation.  Let  the  input  size  of  the  problem  be  N 
and  output  size  M .  We  will  show  how  to  perform  the  Beta  Operation 
on  an  A-node  hypercube  in  0(log  N  +  log2  M)  time.  For  a  y/N  x  \fN 
mesh-of-trees.  we  require  O(log  N  +  VJf)  time. 


1  Introduction 


.'  The  ever  decreasing  cost  of  computer  processors  has  created  a  great  interest 
in  multi-processor  computers.  However,  along  with  the  increased  power  that 
this  parallelism  brings,  comes  increased  complexity  in  programming. 

One  approach  to  lessening  this  complexity  is  to  provide  the  programmer 
with  general  purpose  parallel  primitives  that  shield  hina.from  the  structure  of 
the  underlying  machine.  In  The  Connection  Machine  ~[Hi8&]»,flillis  suggests 
the  Beta  Operation  a  parallel  primitive  for  his  hypercube- based  machine. 
j  this  paperrwe^shaH  explore^efficient  ways  to  perform  this  operator  on 

several  different  well  known  architectures  including  the  hypercube.  We  then 
presenb^some  lower  bounds  associated  with  the  problem.  _ 


/  -v 


1.1  The  Beta  Operation 


For  a  two-argument  function,  F ,  and  an  array  of  values,  C  =  [co, . . . ,  cm],  let 
us  define  the  F -reduction  of  C  as  the  natural  (APL  style)  reduction,  except 
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with  no  specified  evaluation  order.  That  is 


£ 


*Work  supported  by  ONR  grant  N00014-85-C-0731 . 


Jur.t  il  '.  !’• 


t.  \  .ill.. 


n 


By_ 


Pistrlbut i on/ 
Availability  Codes 
lAv.iil  ••n.’./or 
Spec  ini 


Dist 


m 


•  F/[co,...,cm]  =  F(F/[cTo,.  .  ..c,,],  F/[c*.+1, . .  .  ,c,m])  for  some  0  < 
i  <  m,  and  some  permutation  ir. 

We  are  given  an  F,  and  N  pairs,  (go,vo), .  ■  .  ,  (<jf/v_i,  v;v-i),  as  input  to 
a  Beta  Operation.  The  gj’s  should  be  thought  of  as  group  numbers  and  the 
Vj's  as  data.  Let  us  call  the  collection  of  ( g ,  v)-pairs  with  the  value  g  =  i,  the 
j-block.  Occasionally,  we  will  also  use  the  term  i-block  to  refer  to  the  set  of 
processors  holding  the  (<7,  u)-pairs  of  an  i-block  when  no  misunderstanding 
can  result.  Let  G  =  {*  |  3j,  gj  =  1}  (that  is,  G  is  the  set  of  the  t’s  such 
that  block  »  has  at  least  one  element).  If  s<  is  the  array  of  v- values  from  the 
i-block  then  a  Beta  Operation  computes  the  values  j/,  =  Fj s, ,  for  i  €  G. 

Note  that  with  our  definition  of  the  F-reduction  of  a  block,  performing 
a  Beta  Operation  is  ill-defined  unless  F  is  commutative  and  associative. 

There  are  two  slightly  different  formulations  of  the  Beta  Operation.  In 
both  formulations,  each  processor  j  initially  contains  a  pair,  (gj,vj).  In 
the  first  formulation,  the  |G|  non-trivial  (i,  j/<)  pairs  end  up  in  sorted  order 
(according  to  i)  in  the  first  \G\  processors.  In  the  second  formulation,  at  the 
end  of  the  operation,  each  y,  appears  at  processor 

This  difference  is  generally  unimportant  since  output  of  the  first  type 
can  be  converted  to  that  of  the  second  type  output  with  a  single  monotone 
routing.  For  all  the  networks  that  we’ll  consider,  the  time  to  perform  a  Beta 
Operation  will  dominate  the  time  of  a  monotone  routing. 

2  Hypercube 

We  focus  first  on  N  processor  hypercube  systems,  where  there  is  a  known 
bound  on  |G|.  We  shall  discuss  the  necessary  modifications  for  the  case 
when  |G|  is  unknown  in  Section  4.  For  simplicity,  we  shall  assume  that  \G\ 
is  a  power  of  2.  It  is  true  that  2'-1  <  |G|  <  2‘  for  some  t.  If  we  assume 
that  \G\  is  really  2‘,  the  algorithm  will  work  with  the  same  asymptotic  time 
complexity.  Let  N  =  2n  and  |G|  =  2*.  1 

2.1  The  Generic  Step 

In  each  step  of  this  algorithm,  we  will  conceptually  break  the  hypercube 
into  smaller  hypercubes.  We  then  perform  the  Beta  Operation  on  all  of  the 

‘The  special  cases  where  q  doesn’t  divide  n  evenly  or  |G|3  >  N  can  be  treated  by 
trivially  modifying  the  algorithm  given. 


subcubes  in  parallel  by  applying  the  following  sequence  of  subroutines. 
Sort.  We  sort  the  ( g ,  v)-pairs  in  the  subcube  by  p-value. 


Reduce.  For  each  distinct  <7,  we  combine  the  pairs  with  that  g  into  a  single 
(<7,  v)-pair,  by  applying  the  function  F  to  the  associated  u-values. 

Compact.  We  route  the  resulting  {g,v)- pairs  (<  |G|  of  them)  into  the 
lowest  numbered  processors  of  the  subcube,  retaining  their  sorted-by- 
g  order. 

We  will  organize  the  algorithm  such  that  at  the  end  of  step  i  in  Phase 
2,  there  will  be  (<7,u)-pairs  only  in  the  jV/|G|*+2  processors  with  binary 
representations: 

(1+2)9  n— (1+2)9 


By  the  end  of  the  last  step  (i  =  n/q- 3)  of  this  phase,  we  will  have  performed 
the  Beta  Operation  on  the  whole  hypercube. 


2.2  Phase  1 

We  break  the  N  processors  into  N/\G\3  hypercubes  of  |G|3  nodes  each  such 

that  hypercube  j  has  binary  representation: 

3  q  n— 3q 

*  •  •  ’  *  j 

For  each  hypercube  we  perform  the  following: 

Sort.  We  use  the  odd-even  merge  sort  to  sort  by  p-value. 

Reduce.  Using  0(log|G|)  distribution-from-leaders  [U84],  we  can  combine 
the  |G|3  ( g ,  u)-pairs  into  one  ( g ,  v)-pair  per  distinct  g.  Since  this  reduc¬ 
tion  takes  the  same  time  as  the  above  sorting  subroutine,  0(log2  |G|), 
it  suffices  for  asymptotic  analysis.  Nevertheless,  for  various  reasons 
that  will  become  clear  later,  it  is  important  to  decrease  the  time  taken 
by  this  step  to  0(log|G'|).  The  reduction  can  be  done  efficiently  as 
follows. 
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2.2.1  Efficient  Hypercube  Reduction 

Let  us  call  the  largest  hypercube  contained  in  an  i-block  the  central 
block  (CB).  (If  an  i-block  has  two  largest  hypercubes  that  can’t  be 
merged  because  their  addresses  are  of  the  form  and  (fc+l)-0’*J, 

then  always  choose  the  lower  numbered  one.)  We  show  in  Lemma  2.1 
below  that  the  CB  for  an  i-block  of  size  s,  2J+2  —  1  >  s  >  2J+1  —  2, 
must  be  of  size  2}  or  2J+1.  The  reduction  takes  3  steps. 

Step  1.  All  the  processors  in  a  particular  i-block  determine  if  they 
are  part  of  the  CB  for  that  i-block  or  not. 

Step  2.  All  the  processors,  p*,  not  in  the  CB  for  their  i-block,  send 
their  ( g ,  v)-pairs  to  either  pfc+|cB|.  Pfe-|CB|,  °r  P/t-2|CB|  depending 
on  which  of  these  is  an  address  in  their  i-block’s  CB. 

Step  3.  In  parallel,  each  of  the  CB’s  reduces  all  of  its  (</,  u)-pairs  to 
a  single  value. 

In  Step  1,  each  processor  checks  the  two  processors  on  either  side  to 
determine  if  it  is  the  first  or  last  processor  in  its  i-block.  Then,  with 
two  distribution-from-Ieaders,  each  processor  can  be  told  the  numbers 
of  the  first  and  last  processors  in  its  i-block.  Using  this  information  a 
processor  can  determine  if  it  is  part  of  the  CB  in  constant  time.  Step 
2  is  exectued  as  follows.  There  are  at  most  |CB|  —  1  elements  in  the  i- 
block  before  the  first  element  in  the  CB.  If  this  were  not  the  case  then 
the  first  |CB|  elements  would  constitute  the  CB.  There  are  at  most 
2  •  |CB|  -  1  elements  after  the  CB,  or  the  CB  would  be  twice  as  big. 
The  processors  who  are  not  part  of  the  CB  can  send  their  pairs  over 
to  the  appropriate  processors  of  the  CB  with  three  montone  routing 
steps.  Step  3  is  straightforward.  The  total  time  for  all  three  steps  is 
O(logN). 

Lemma  2.1  The  CB  for  an  i-block  of  size  s,  23+2  -  1  >  s  >  2J+1  -  2, 
must  be  of  size  23  or  2J+1 . 

Proof:  Clearly  |CB|  can  be  no  bigger  than  2J+l. 

Assume  that  we  have  a  CB  of  size  2h,  where  0  <  h  <  j  —  1,  with 
addresses  of  the  form  (fc)  •  **,  with  k  even.  The  element  at  location 
(k  —  1)  ■  0**  is  not  in  the  i-block,  because  then  the  hypercube  starting 


at  this  address  would  be  the  CB.  The  element  at  location  (k-\- 1)  •  is 
not  in  the  i-block,  because  otherwise  concatenating  the  blocks  ( k )  ■  *h 
and  (k  +  1)  •  *h  would  give  us  a  CB  of  size  2h+1.  Thus  s  <  3  •  2h  -  2, 
which  is  a  contradiction  for  all  h's  in  the  range  specified. 

Simlarly,  we  get  a  contradiction  if  we  assume  that  we  have  a  CB  of 
size  2h  with  addresses  of  the  form  ( k )  •  *h,  with  k  odd.  | 

Compact.  Consider  the  (p,u)-pairs  left  by  the  reduce  stage.  By  means 
of  a  prefix  operation  we  can  compute,  for  each  ( g ,  v)-pair,  how  many 
(p,  u)-pairs  are  in  lower  numbered  processors.  Then  we  can  compact 
via  a  monotone  routing. 

2.3  Phase  2 

Steps  i  =  1  through  i  =  n/q  —  3: 

We  break  the  N /\G\'~*  lowest  numbered  processors  into  N /\G\'+2  hyper¬ 
cubes  of  \G\A  nodes  each,  such  that  hypercube  j  has  binary  representation: 

(*—1)9  4?  n-(t'+3)g 


Note  that  with  this  choice  of  partitioning  the  hypercube,  each  subcube  has 
only  \G\2  (g,  v)-pairs.  This  is  because  before  Step  i  only  the  processors  with 
addresses  of  the  form: 

(«+!)<?  2  q  n— (1+3)7 


contain  (p,  v)-pairs.  For  each  subcube,  perform  the  following: 

Sort.  We  can  use  the  Nassimi-Sahni  sort  [NS82],  to  sort  the  |G|2  ( g ,  v)-pairs 
by  (7-value. 

Reduce.  It  is  easiest  to  view  the  |G|4  node  hypercube  as  a  \G\2  x  |G|2 
matrix,  ptJ,  with  the  pro  essors  arrayed  in  order  of  increasing  proces¬ 
sor  number  (in  row-major  form).  Initially,  only  the  first  row  contains 
(ff,  u)-pairs.  Using  a  prefix  operation,  we  can  determine  which  proces¬ 
sors  are  the  leftmost  processors  in  their  i-block.  Call  these  the  leaders. 
We  start  by  broadcasting  the  contents  of  each  first  row  processor,  pij, 
to  the  column  j.  Then  each  processor,  pjj,  broadcasts  to  row  j.  Fi¬ 
nally.  in  the  columns  of  the  leaders,  F  is  applied  to  those  v-values 
whose  corresponding  g-values  match  the  leader’s. 


Compact.  As  in  Phase  1,  consider  the  ( <7,  u)  pairs  left  by  the  reduce  stage. 
Move  all  of  these  pairs  to  the  first  row.  All  the  processors  that  don’t 
contain  one  of  these  pairs  set  their  (/-value  to  infinite.  We  can  then 
compact  by  sorting  on  (/-value  using  the  Nassimi-Sahni  sort. 

2.4  Time  Analysis 

The  sort  step  of  Phase  1  takes  O(log2  |G|)  time.  The  reduce  and  compact 
subroutines  of  Phase  1  both  take  0(log  jG|)  time.  In  every  step  of  Phase  2, 
each  of  these  subroutines  takes  0(log  JG|)  time.  There  are  0( log  IV/logIGI) 
such  steps  so  Phase  2  takes  time  0(\ogN).  Thus,  the  overall  time  for  the 
algorithm  is  O(log  N  +  log2  |G|). 

3  Mesh-of- Trees 

We  first  note  that  the  Beta  Operation  can  be  performed  easily  in  time 
0(y/N)  on  a  y/N  x  \fN ,  iV-processor  mesh  system,  even  if  |G|  is  not  known 
beforehand.  This  upper  bound  is  tight  since  there  is  an  obvious  lower  bound 
of  Q(y/~N)  time  even  when  |G|  is  given.  In  the  case  of  mesh-of-trees  (MOT) 
our  results  are  for  the  y/N  X  y/N,  0(N)  processor  MOT  system  where  there 
is  a  known  bound  on  |G|.  2 

3.1  The  Generic  Step 

The  generic  steps  in  Phases  2  and  3  will  be  essentially  the  same  as  the 
generic  step  of  the  hypercube  algorithm.  The  essential  difference  is  that 
the  size  of  the  sub-MOT’s  we  work  on  will  grow  each  step.  Remember  that 
in  the  hypercube,  for  the  steps  in  Phase  2,  we  were  always  working  with 
sub-hypercubes  of  a  single  size  ( |G|4  nodes  each).  Another  minor  difference 
is  that  each  sort-reduce-coinpress  step  is  preceded  by  a  routing  step. 

We  start  Phase  1  by  performing  the  Beta  Operation  on  sub-MOT’s  with 
side  ^/4|G|.  In  Phase  2  we  increase  the  size  of  the  sub-MOT’s  considered  un¬ 
til  the  number  of  processors  is  equal  to  the  square  of  the  number  of  remaining 
((7,v)-pairs  in  each  sub-MOT.  In  Phase  3  we  can  then  quickly  increase  the 
sub-MOT  size  to  y/N  X  y/N. 

7  As  was  the  case  witli  the  hypercube.  we  shall  disregard  the  special  cases  when  divisions, 
square  roots  and  logarithms  produce  non-integral  values.  Although  these  cases  present  no 
special  problems,  dealing  with  them  introduces  needless  clutter. 


3.2  Phase  1 


Break  the  N  processors  into  Ar/4|6'|  sub-MOT’s  with  side  \/4|G|.  We  can 
perform  the  Beta  Operation  on  these  sub-MOT’s  using  just  the  mesh  con¬ 
nections.  Note  that  this  can  be  done  without  knowing  |G|  beforehand.  We 
simply  sort  on  the  ^-values  and  reduce  the  resulting  i-blocks  to  single  values. 

3.3  Phase  2 

Steps  i  =  1  through  3q/2  -  1: 

Route.  Immediately  before  Step  i,  the  MOT  is  divided  into  N,/(4*|G|)  sub- 
MOT’s  with  sides  of  length  ^/4'|G|.  The  first  [^|G|/4'|  rows  of  each 
such  sub-MOT  contain  the  <  |G|  different  (</,  w)-pairs,  compacted  to 
the  left.  For  convenience,  these  initial  rows  of  the  sub-MOT  shall 
henceforth  be  called  the  non-trivial-part  (NTP).  We  start  step  i  by 
conceptually  clumping  4  contiguous  sub-MOT’s  into  a  single  square 
sub-MOT  with  twice  the  side  length.  We  first  shift  up  the  NTP’s  of 
the  two  lower  blocks  so  that  they  are  contiguous  to  the  NTP’s  of  the 
upper  blocks.  This  results  in  a  sub-MOT  with  side  y4’+1|G|  having 

a  NTP  occupying  the  first  2  [^/|G|/4‘|  rows. 

Sort.  We  can  then  sort  this  new  NTP  using  the  odd-even  merge  sort  out¬ 
lined  in  Theorems  3.2  and  3.1. 

Reduce.  For  each  group  number  there  are  up  to  four  different  (g,  v)-pairs. 
We  can  combine  these  to  produce  one  ( g,v )  pair  in  0(log|G|). 

Compact.  All  processors  not  holding  one  of  these  pairs  set  their  ^-values 
to  infinite.  We  then  sort  again  on  group  number  so  that  the  NTP  is 
compacted  in  the  first  |G(  spaces  (in  the  row-major  sense). 

At  the  end  of  this  phase,  we  have  N /\G\A  sub-MOTs  of  side  |G)2,  each 
with  no  more  than  |G|  non  trivial  (<7, u)-pairs. 

3.4  Phase  3 

Steps  i  =  1  through  log(n/2r/  -  1) 

In  each  step  i  we  will  increase  the  side  of  the  sub-MOT  from  |G|2  +1 

to  |G|2'+1.  The  last  step  will  leave  us  a  single  MOT  with  side  \/N .  Notice 


that  in  each  sub-MOT,  the  number  of  processors  will  always  be  equal  to  the 
square  of  the  number  of  non-trivial  (g,  t>)-pairs. 

The  route-sort-reduce-compact  stages  are  performed  in  each  sub-MOT 
as  follows: 

Route.  At  the  beginning  of  step  i  we  have  sub-MOT’s  of  side  \G\2'  1+1 , 
each  with  no  more  than  |G|  non-trivial  ( g,v )-  pairs.  We  will  conceptu¬ 
ally  clump  |G|2'  of  these  sub-MOT’s  into  sub-MOT’s  of  side  |G|2'+1. 
Consider  each  such  sub-MOT  of  side  |G|2’+1  as  being  composed  of 
|G|2'~'  columns  of  width  |G|2'~  +1.  In  the  routing  step  we  move  the 
NTP’s  from  all  the  sub-MOT’s  in  each  column  into  the  controllers3  in 
that  column  as  a  preliminary  to  sorting. 

Sort.  For  each  clump  we  have  a  sub-MOT  of  |Gj2  +1+2  processors  and 
|G|2  +1  non-trivial  (<7,  u)-pairs.  Thus  we  can  sort  within  each  clump 
using  the  standard  MOT  algorithm  [U84]. 

Reduce.  The  reduce  step  looks  very  much  like  the  standard  MOT  sorting 
algorithm.  First,  every  controller  checks  to  see  if  the  group  number  it 
contains  is  the  leftmost  such  group  number.  As  with  the  hypercube 
algorithm  we  shall  call  such  processors  leaders.  Next,  each  controller 
broadcasts  its  value  to  its  entire  row  and  column.  Finally,  in  the 
columns  of  the  leaders,  F  is  applied  to  the  v- values  whose  correspond¬ 
ing  p-values  match  that  of  the  column’s  leader. 

Compact.  Another  sort  will  then  compact  the  values.  It  is  assumed  as 
always  that  non-leader  controllers  have  infinite  p-values. 

3.5  Timing 

We  use  the  following  theorems  in  analyzing  the  time  required  to  perform 
the  algorithm  for  the  Beta  Operation  on  the  MOT. 

Lemma  3.1  An  arbitrary  partial  permutation  routing  of  s  elements  that 
start  and  end  on  the  leaves  of  a  complete  binary  tree  with  m  leaves  can  be 
performed  in  time  0{s  +  logm). 

Proof:  Let  S/r  be  the  elements  that  need  to  be  routed  from  the  left 
half  of  the  tree  to  the  right  half.  Similarly  define  Srj,  Su  and  Srr .  We 

3  V\V  follow  the  lead  of  Ullinan  [U 84]  in  viewing  the  root  of  the  ith  column  tree  and  ith 
row  tree  as  being  a  single  node.  We  shall  refer  to  this  node  as  the  ith  controller. 


can  pipe  the  elements  of  S/r  to  their  destinations  in  time  0(|£/rj  +  logm). 
Similarly,  the  elements  of  Sr/  can  be  routed  in  time  0(|5r;|  +  logm).  To 
route  the  elements  of  Su,  we  actually  break  it  into  two  consecutive  routings. 
In  the  first,  the  ebments  are  routed  from  their  starting  locations  in  the  left 
half  of  the  tree  to  locations  on  the  right,  and  then  in  the  second  they  are 
routed  from  the  right  half  back  to  their  destinations  on  the  left.  This  takes 
time  0(|S//|  +  logm).  Similarly,  the  elements  in  Srr  can  be  routed  in  time 
0(|Srr|  +  log  m)-  Since  s  =  ]5r;|  +  |£(r|  +  \Su  \  +  |Srr|,  the  overall  routing  can 
be  done  in  time  0(s  +  logm).  | 

Lemma  3.2  Given  an  MOT  of  side  m  with  all  elements  contained  in  the 
first  s  rows.  In  time  0(s  +  log  m),  we  can  achieve  any  peri . .utation  in  which 
the  elements'  final  destinations  are  also  within  the  first  s  rows. 

Proof:  Let  R,  j  be  the  row  of  the  destination  of  the  element  that  starts 
in  row  i,  column  j;  similarly,  C,,7  is  the  column  of  the  destination.  We  apply 
Lemma  3.1  three  times.  It’s  applied  first  to  the  columns,  then  to  the  rows 
and  then  to  the  columns  of  the  MOT  such  that  each  element  from  (i,j) 
follows  the  permutations:  (i,j)  -*  (i  +  j  mod  m,j)  — *  ( i-\-j  mod  m,Cij)  — ► 
C,j).  Each  of  these  three  permutation  operations  can  be  performed 
in  0(s  +  logm)  time  yielding  the  desired  result.  | 

Theorem  3.1  Consider  a  MOT  with  side  m.  Assume  that  it  is  divided 
vertically  into  two  halves  and  that  the  first  s  (1  <  s  <  m)  rows  on  the  left 
side  contain  the  sorted  list,  A,  and  the  first  s  roivs  on  the  right  side  contain 
the  sorted  list,  B .  Then  we  can  merge  these  two  lists,  with  the  results  ending 
up  in  the  first  s  rows,  in  time  T(s,m)  =  0{s  +  logmlog2s). 

Proof:  This  can  be  done  with  odd-even  merge. 

Step  1.  In  step  1  we  separate  out  the  odd-position  A’s  ( A0u )  from  the 
even-position  A' s  (.4r„tn)  so  that  A0a  occupies  the  first  s/ 2  rows  and 
A eve„  occupies  the  s/2  rows  starting  at  row  m/2.  This  can  be  done  in 
the  manner  of  Lemma  3.2  in  time  ci(s  +  logm).  Simultaneously,  we 
separate  the  B's. 

Step  2.  In  step  2  we  exchange  the  positions  of  A0u  and  BeVen  •  This  can 
also  be  done  in  time  C2(s  +  logm). 
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Step  3.  We  now  want  to  merge  lists  that  are  stacked  vertically.  Consider 
the  m/2  X  m/2  square  in  the  upper  left.  We  separate  out  the  even- 
position  Beven  {B'Venexicn)  and  the  odd-position  Beven  ( Beven<)id ).  The 
even  positioned  Beven's  go  on  the  left  and  the  odd  ones  on  the  right. 
Simultaneously,  separate  the  B0dd's,  the  4e„en’s  and  the  A0dd' s.  This 
can  be  done  in  time  C3(s  +  logm). 


Step  4.  We  exchange  Beveneven  and  Aeveriodd.  Simultaneously,  exchange 
B0ddodd  and  A0ddeven-  This  can  be  done  in  time  c,}(s  +  logm). 

Step  5.  We  now  have  4  sub-MOTs  with  side  m/2  and  s/2  rows.  Recursively 
merge  these  in  time  T(s/2,m/2)  yielding  the  four  lists  ABcveneven, 
ABeVtnodd  i  AB 0dd  even  ’  and  AB„ddgdd- 

Step  6.  We  interleave  ABevcneven  with  ABevenodd .  By  merely  swapping  ad¬ 
jacent  list  elements  we  are  left  with  the  sorted  list  ABeven .  Simultane¬ 
ously,  we  can  interleave  AB0ddeven  with  AB0ddodd ,  yielding  the  sorted 
list  AB0dd  after  the  value  swapping.  Lastly,  interleave  ABeven  with 
AB0dd  and  do  any  needed  value  swapping.  This  can  be  done  in  time 
c5(s  +  logm). 

Basis.  If  s  =  1  then  we  can  sort  in  time  O(logm). 

Induction  step.  Lets  =  so-  Let  Cq  =  cj  +  C2+C3+C4  +  C5.  Then  T(s0,m)  < 
T(so/2,  m/2)  +  c6(so  +  log  m).  Thus  T(s,  m)  =  0(s  +  log  m  log  2s). 

I 

Theorem  3.2  Consider  a  MOT  with  side  m.  Assume  that  there  are  O(ms) 
numbers  in  the  first  s  rows.  We  can  sort  this  list  with  the  results  ending  up 
in  the  first  s  rows,  in  time  T(s,m)  =  0(s  +  logm  log2  2s). 

Proof:  We  use  a  merge  sort.  First  divide  the  MOT  into  four  sub- 
MOTs  of  side  m/2.  Using  routings  of  the  type  in  Lemma  3.1,  distribute 
the  numbers  into  the  first  s/2  rows  of  each  sub-MOT.  Recursively  sort  in 
these  sub-MOTs  in  time  T(s/2,m/2).  We  then  merge  the  four  sorted  lists 
together  using  the  methods  outlined  in  Theorem  3.1.  Hence,  T(s,m)  = 
T(s/2,m/2)  +  c(s  +  logmlog2s).  Solving  this  recurrence  yields  the  time 
bound  claimed  above.  | 
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The  first  phase  will  take  time  0(\/\G\).  For  step  i  of  the  second  phase, 
the  time  is  determined  by  the  sorting.  By  application  of  Theorem  3.2  we 
can  see  that  the  second  phase  will  take  time 

39/2  r  1  r  .. 

0(  £  2  I  yf\G\i*\  +  log  \G\  log2  2  |  f\< G\/A'  ) 

=  0(V/!Gi  +  log4|G-|)  =  0(V/jGi). 

The  third  phase  will  take  time  0(Z^?in/^  log  |G|2‘+1 )  =  O(n).  Thus  the 
overall  time  is  0(logJV  +  \/\G\)- 


4  Determining  the  Output  Size 


The  time  taken  by  the  algorithms  given  above  is  a  function  of  both  the 
input  size,  N,  and  the  output  size,  |(7|  (for  convenience  let  M  —  |G|).  The 
algorithms  that  are  given  assume  that  M  is  known.  Thus  the  question  arises, 
what  do  we  do  if  we  don’t  already  know  M? 

For  a  large  class  of  problems,  and  Beta  Operations  appear  to  be  one 
of  them,  the  problem  of  determining  the  output  size,  M,  is  essentially  as 
complex  as  the  problem  of  computing  the  output  given  the  output  size. 
While  it  would  be  possible  to  develop  separate  algorithms  to  determine  the 
output  size,  we  will  exhibit  below  a  general  scheme  that  enables  one  to 
determine  M  and  solve  the  problem  in  time  proportional  to  that  required 
for  solving  the  problem  given  M. 


4.1  Iterative  Guessing 


We  will  use  a  strategy  of  iterative  guessing  that  depends  on  having  an  algo¬ 
rithm,  call  it  Algorithm  A,  with  the  following  properties: 


•  The  running  time  is  t(N,  Q ),  where  Q  is  a  given  bound  on  the  output 
size. 


•  If  Q  >  M,  then  the  algorithm  works  correctly  and  produces  the  ap¬ 
propriate  output  of  size  M. 


•  If  Q  <  M,  then  the  algorithm  can  detect  the  error  (within  time 
t(N,Q)). 
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•  t(N,Q)  is  monotonically  nondecreasing  in  Q. 


(These  restrictions  can  be  relaxed,  but  they  are  sufficient  for  our  purposes.) 

Using  Algorithm  A,  we  can  create  a  new  algorithm,  Algorithm  B,  that 
can  solve  the  problem  without  knowing  M  beforehand.  Algorithm  B  will 
sequentially  try  the  guesses  (<7i,  g2,  •  •  •)•  That  is,  it  will  first  run  Algorithm 
A  with  Q  =  g\.  If  this  first  guess  is  too  small,  it  runs  it  with  Q  =  g2 ,  then 
Q  =  <73,  etc...  until  Algorithm  A  finally  succeeds. 

It  is  clear  that  an  arbitrary  choice  of  gC s  will  not  result  in  an  efficient 
algorithm.  Let  us  presumptuously  define  an  efficiently-convergent  guess  se¬ 
quence  and  then  justify  the  name.  Let  us  denote  the  minimum  output  size 
possible  for  any  input  by  Mmin.  An  efficiently-convergent  guess  sequence, 
(go, gi,  •  •  ■)  for  the  function  t(N ,  Q)  is  a  sequence  such  that: 

SO  =  Afmin 

cxt(N,g,_x)  <  t(N,gi)  <  c2t(N,g{- 1) 

1  <  Cl  <  C2 

where  ci  and  c2  are  independent  of  i ,  but  can  be  chosen  in  a  fashion  that 
depends  on  the  sequence  of  </;’ s. 

Theorem  4.1  Assume  that  we  are  given  an  algorithm  for  ‘problem  V  given 
M  ’  that  has  all  the  properties  enumerated  above.  Then  if  we  are  given  a  cor¬ 
responding  efficiently-convergent  guess  sequence,  we  can  create  an  algorithm 
to  solve  *P  not  given  M  ’  ,  in  time  t(N,M )  where  t(N,M)  =  0(t(N,  M)). 

Proof:  The  given  algorithm  for  ‘V  given  M'  is  our  ‘Algorithm  A’.  To 
solve  the  problem  ‘P  not  given  M\  we  run  ‘Algorithm  B’.  Let  g,  be  the 
guess  for  Q  on  which  the  Algorithm  A  finally  succeeds. 

From  the  properties  of  Algorithm  A,  it  follows  that  g,-X  <  M  <  gs. 
Hence, 


t(N,9,)  <  c2f(JV,s,-i) 

<  c2t(N,M). 


Also, 


cit(N,gi-\)  <  t(N,gi) 
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<  £*(*,*) 


t(N,gi)+  53(ci  -  l)t(N,gi)  <  t(N,ga) 


"  Cl  -  1 


±t(N,gi)  <  ~^—t{N,ga). 

frr  ci  -  l 


From  the  definition  of  our  algorithm  for  'P  not  given  M \ 


i(N,M)  <  £«(*,*) 

t=l 

✓  j/  nr 


Cl  -  1 


Since  it  is  trivially  true  that 


t(N,M)  >  t(N,M) 


it  follows  that  i(N,M)  =  Q(t(N,M)).  Note  that  the  optimal  choice  of 
ci  =  C2  =  2  yields  a  factor  of  4  slowdown  in  the  worst  case.  | 


4.2  Application  of  Method 

Lemma  4.1  For  t(N,Q)  =  c(log  N  +  log2  Q ),  the  guess  sequence  g0  =  1, 
<7,  =  2^2'_1)log^r  /or  i  >  0  is  efficiently-convergent  with  c\  =  cj  =  2. 


Since  the  algorithm  described  in  Section  2  satisfies  the  properties  enu¬ 
merated  in  Section  4.1,  the  corollary  below  follows  from  the  above  lemma 
and  Theorem  4.1. 


Corollary  4.1  Beta  Functions  on  |G|  groups  can  be  computed  in  time 
0(logN  +  log2|(7|)  on  a  hypercube,  without  prior  knowledge  of\G\. 


Lemma  4.2  For  t{N,Q)  =  c(log N  +  y/Q),  the  guess  sequence  g0  =  1, 
gi  =  ((2‘  -  l)logN)  for  i  >  0  is  efficiently-convergent  with  c\  =  C2  =  2. 


i  «  i 

/  i  '  K  „■  *  i 


•  ,*  . :  A  ■'  "  V 


Since  the  algorithm  described  in  Section  3  satisfies  the  properties  enu¬ 
merated  in  Section  4.1,  the  corollary  below  follows  from  the  above  lemma 
and  Theorem  4.1. 

Corollary  4.2  Beta  Functions  on  |G|  groups  can  be  computed  in  time 
O(log  N  +  x/[G|)  on  a  MOT,  without  prior  knowledge  of  \G\. 

5  Lower  Bounds 

In  this  section,  we  will  prove  some  lower  bounds,  given  our  formulation  of  the 
Beta  Operation,  and  relate  them  to  the  areas  and  times  for  the  algorithms 
and  architectures  discussed  above.  Note  that  while  in  the  other  sections  of 
this  paper  we  use  the  word  model  of  computation,  here  we  use  the  bit  model 
of  computation. 

The  input  is  N  pairs  of  numbers,  (go,  Vo),  (<?i,  t>i), . . . ,  i),  each 

of  which  is  in  the  range  0  to  IV  -  1.  Let  G,  =  {u,  |  g3  =  t}.  For  all  t  such 
that  | G,-|  >  0,  we  output  (*,  y,)  in  so  ted  order  (according  to  t)  where  y,  is 
the  F-reduction  of  G,-.  As  above,  let  G  =  {i  |  )G<|  >  0}.  Let  wq  be  the 
smallest  member  of  G;  similarly,  let  toj  ( i  <  |G|)  be  the  (i  +  l)-th  smallest 
member  of  G.  Define  a,  =  yWi;  that  is,  a,-  is  the  y- value  of  the  (i  +  l)-th 
output.  We  will  refer  to  the  the  j-th  bit  of  z,  as  Zij.  (Similarly  for  the  g's.) 

5.1  A  Lower  Bound  on  Area 

(The  structure  of  this  lower  bound  proof  closely  follows  the  one  for  sorting 
in  [U84].)  First  we  will  show  that 

Lemma  5.1  In  a  when-  and  where-determinate  circuit  that  performs  the 
Beta  Operation  correctly  for  any  |G|,  none  of  the  output  bits  (for  i  < 
N  -  l)  can  be  output  before  all  of  the  input  bits  g,tj  (for  j  >  0)  have  been 
read. 

Proof:  Assume,  to  the  contrary,  that  zPtq  (p  <  N  —  1)  is  output  before 
gT  t  (s  >  0)  is  read.  We  construct  two  possible  inputs.  Fix  every  g  and  v, 
except  gr,  as  follows 

•  Choose  a  t  r.  Set  gt  =  p  +  1;  vt  =  2q. 

•  Set  all  other  i>;  =  0. 
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•  For  all  i  (other  than  i  =  t  and  i  —  r)  choose  a  value  of  g,  (different 
from  p  and  p  +  1)  in  such  a  way  that  Vj.  0  <  j  <  p.  such  that 
9 .  =  J- 

The  two  possible  inputs  yielded  by  setting  either  gT  --  p  or  gr  -  p  2* 
produce  different  values  for  the  bit  zpq.  Yet  :pq  is  output  before  gr ,  is  read 
-  contradiction.  | 

Theorem  5.1  Any  when-  and  where-determinate  circuit  that  can  jx-rfonn 
a  Beta  Operation  on  A  inputs  must  have  A  =  Q(  A  log  A  ). 

Proof:  (This  proof  assumes  A  is  even.  The  proof  for  A  odd  is  sum 
lar.)  We  will  construct  a  family  of  inputs  of  size  (A/2)!  each  with  different 
outputs.  For  all  t,  fix  u,  =  t.  For  i  >  A/2,  fix  g,  =  2t  -  A  +  1.  Now. 

we  allow  the  remaining  inputs.  g0 . Qspi-i  to  be  any  permutation  of  the 

even  numbers  less  than  A . 

Focus  on  the  time  just  before  the  first  r,tJ  (for  i  <  A  -  1 )  is  output.  The 
circuit  has  already  read  all  of  the  bits  that  differ  between  the  elements  of 
our  family  of  inputs.  Hence,  all  inputs  read  subsequently  will  be  the  same 
for  any  element  of  our  family.  Since  the  circuit  must  still  produce  (A/2)! 
different  outputs,  it  must  have  at  least  (A/2)!  states  and  hence  at  least 
log((A/2)!)  =  fi(Alog  N)  bits  and  area.  | 

5.2  An  AT 2  Lower  Bound 

(The  structure  of  this  lower  bound  proof  closely  follows  the  one  found  in 
[Ho85].) 

Theorem  5.2  Any  when-  and  where-determinate  circuit  that  can  perform 
Beta  Operations  for  any  M  =  |G|  requires  AT 2  =  ff(  AA/ log  A ),  where  A 
is  the  number  of  input  pairs. 

Proof:  If  there  is  an  input  pad  that  reads  ^(A/1/2)  bits  of  the  u,’s  then 
T  —  This,  coupled  with  the  above  theorem,  implies  our  AT 2 

bound. 

If  no  pad  reads  bits,  then  we  may  partition  our  circuit  into  a 

set  of  blocks  B  with  |2?|  =  Q(  —  ^ ,v  )  so  that 

•  each  block  in  B  has  area  0(  )  and  perimeter  4 

4We  use  "perimeter”  to  mean  the  perimeter  not  including  the  external  boundary  of 
the  circuit. 


•  each  block  in  B  reads  at  most  O(M)  bits  of  the  u,’s. 

Such  a  collection  of  blocks  is  obtained  by  first  cutting  the  circuit  into 
(-)(  —  1  )  blocks  each  of  perimeter  0(yj  7 )  ( Lemma  2.3  of  [U84]).  Then 

another  Q(  N  )  cuts  suffice  to  ensure  that  each  resulting  block  reads  at 
most  /max  =  O(M)  bits  of  the  v,'s. 

Two  statements  are  true  of  the  above  set  of  blocks: 

•  At  least  half  of  these  blocks  produce  less  than  twice  the  average  num¬ 
ber.  0(^),  of  the  output  z,  bits  for  t  <  M . 

•  Let  /aVf  be  defined  as  ArlogAr/|2?|  =  Q(M).  More  than  half  of  the 
blocks  read  at  least  2/ave  —  /max  input  bits. 

Note  that  we  can  stay  within  our  block  cutting  rules  and  still  make  our 
blocks  small  enough  so  that  2/ave  -  /max  =  Thus,  there  is  some 

block,  b  6  B,  that: 

•  reads  at  least  /j  =  Q(M)  input  bits  from  I2  of  the  v;’s  (assume  without 
loss  of  generality  that  these  are  from  vq,  . . . ,  v/2_i ); 

•  has  perimeter  N^N)\ 

•  produces  I3  =  0(^-)  of  the  2,  output  bits  with  i  <  M. 

We  may  then  construct  a  fooling  set  as  follows.  Let 

V  =  {(i,j)  |  b  reads  in  bit  j  of  i\} 

Z  =  {i\b  outputs  a  bit  of  2,,  and  i  <  M). 

For  each  1  <  i  <  1 2,  choose  a  distinct  a,  such  that  a*  ^  Z  and  0  <  a;  <  M 
(note  that  we  can  stay  within  our  block  cutting  rules  and  still  make  our 
blocks  small  enough  so  that  h  +  h  <  M  —  1).  Let  c  =  M  —  l?.  Choose 
(3 \,...,(3C  (each  <  M)  to  be  distinct  from  each  other  and  the  a,’s.  Choose 
the  g,'s  as  follows 


And  the  v,’s  as 


bit  j  of  v;  =  0  or  1,  for  (i,j)  €  V 
bit  j  of  V{  =  0,  for  (i,j)  $  V 

Each  of  our  2,]  choices  for  the  input  yields  a  different  configuration  of  the 
output  bits  outside  of  6.  As  /i  =  fl(M),  the  fooling  set  is  of  size  This 

yields  the  bound  that  is,  AT2  =  Q{NM  log  N).  | 

6  Conclusions 

We  showed  in  Section  2  that  the  Beta  Function  Problem  can  be  solved 
in  time  0(logiV  +  log2  |G|)  on  a  hypercube.  We  can  achieve  this  same 
time  bound  on  a  shuffle-exchange  graph.  In  Section  3,  we  showed  that  the 
problem  can  be  solved  in  time  0(y/R* T  +  log  AT)  on  a  mesh-of- trees.  The 
resulting  AT 2  bound  of  N  log2  A(|G|  +  log2  A)  is  within  a  few  log  N  factors 
of  the  lower  bound  of  AT 2  =  ft(./V|G|log N)  shown  in  Section  5  even  after 
accounting  for  the  word  model  vs.  bit  model  distinction.  In  Section  4  we 
showed  that  in  a  wide  variety  of  cases,  including  the  ones  above,  the  time 
bounds  can  can  be  achieved  even  when  |G|  is  not  known  beforehand. 

Several  variations  on  the  Beta  Function  Problem  are  possible.  As  de¬ 
scribed  in  Section  1,  at  the  end  of  the  Beta  Operation,  either  the  |G|  non¬ 
trivial  (i,y,)  pairs  end  up  in  the  first  |G|  processors  or  else  each  j/<  appears 
at  processor  i.  In  some  applications  it  may  be  appropriate  to  end  the  com¬ 
putation  with  every  processor  holding  the  reduction,  y9,  corresponding  to 
the  group  of  its  original  ( g ,  v)-pair.  This  problem  is  like  computing  |G|  inde¬ 
pendent  census  functions  [LV81]  in  parallel.  Let  us  call  it  the  Beta/Census 
Problem.  This  can  be  implemented  by  first  computing  the  yg’s  and  then 
running  the  Beta  Operation  in  reverse.  One  can  run  the  Beta  Operation  in 
reverse  if  a  trace  of  the  forward  Beta  Operation  was  stored  in  the  network. 
In  general,  this  may  be  costly  in  terms  of  memory  (=area).  Fortunately, 
the  algorithms  demonstrated  in  this  paper  can  be  augmented  to  solve  the 
Beta/Census  Problem  with  only  a  constant  factor  increase  in  time  and  area. 

It  is  interesting  to  note  that  the  Beta  Operation  can  be  done  probabilis¬ 
tically  on  the  hypercube  in  time  G(logA).  Remember  that  the  G(log2|G|) 
term  comes  solely  from  the  sorting  subroutine  in  Phase  i.  If  we  use  Flashsort 
[RV83]  to  sort,  the  bound  obviously  follows. 
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